Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms
نویسنده
چکیده
The purpose of this short paper is to share a recent observation I made in the context of my introductory graduate course on MapReduce at the University of Maryland. It is well known that since the sort/shuffle stage in MapReduce is costly, local aggregation is one important principle to designing efficient algorithms. This typically involves using combiners or the so-called in-mapper combiner technique [5]. However, can we be more precise in formulating this design principle for pedagogical purposes? Simply saying“use combiners”or“use in-mapper combining” is unsatisfying because it leaves open the obvious question of how? What follows is my attempt to formulate a more precise design principle in terms of monoids—the idea is quite simple, but I haven’t seen anyone else make this observation before in the context of MapReduce. Let me illustrate with a running example I often use to illustrate MapReduce algorithm design, which is detailed in Lin and Dyer [5]. Given a large number of key–value pairs where the keys are strings and the values are integers, we wish to find the average of all the values by key. In SQL, this is accomplished with a simple group-by and Avg. Here is the näıve MapReduce algorithm:
منابع مشابه
Space-Efficient Bimachine Construction Based on the Equalizer Accumulation Principle
Algorithms for building bimachines from functional transducers found in the literature in a run of the bimachine imitate one successful path of the input transducer. Each single bimachine output exactly corresponds to the output of a single transducer transition. Here we introduce an alternative construction principle where bimachine steps take alternative parallel transducer paths into account...
متن کاملMapReduce Algorithms for Big Data Analysis
There is a growing trend of applications that should handle big data. However, analyzing big data is a very challenging problem today. For such applications, the MapReduce framework has recently attracted a lot of attention. Google’s MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications. In this tutorial, we will introduce the MapReduce framework based...
متن کاملParallel Decision Tree with Application to Water Quality Data Analysis
Decision tree is a popular classification technique in many applications, such as retail target marketing, fraud detection and design of telecommunication service plans. With the information exploration, the existing classification algorithms are not good enough to tackle large data set. In order to deal with the problem, many researchers try to design efficient parallel classification algorith...
متن کاملA MapReduce and MPI Programming Model for Distributed Large Scale 3D Mesh Processing
Developing a high performance platform for large-scale, high-intensity data processing is a priority for researching cost-effective parallel finite element methods (FEM). This paper introduces an efficient MapReduce-MPI based strategy for parallel 3D finite element mesh processing, demonstrates the potential benefits of this approach for optimally utilizing system resources. Preliminary experim...
متن کاملSorting, Searching, and Simulation in the MapReduce Framework
In this paper, we study the MapReduce framework from an algorithmic standpoint and demonstrate the usefulness of our approach by designing and analyzing efficient MapReduce algorithms for fundamental sorting, searching, and simulation problems. This study is motivated by a goal of ultimately putting the MapReduce framework on an equal theoretical footing with the well-known PRAM and BSP paralle...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1304.7544 شماره
صفحات -
تاریخ انتشار 2013